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Summary. The paper proposes a Riemannian Manifold Hamiltonian Monte Carlo sampler 
to resolve the shortcomings of existing Monte Carlo algorithms when sampling from target 
densities that may be high dimensional and exhibit strong correlations. The method provides 
a fully automated adaptation mechanism that circumvents the costly pilot runs required to 
tune proposal densities for Metropolis-Hastings or indeed Hybrid Monte Carlo and Metropo- 
lis Adjusted Langevin Algorithms. This allows for highly efficient sampling even in very high 
dimensions where different scalings may be required for the transient and stationary phases 
of the Markov chain. The proposed method exploits the Riemannian structure of the param- 
eter space of statistical models and thus automatically adapts to the local manifold structure 
at each step based on the metric tensor. A semi-explicit second order symplectic integrator 
for non-separable Hamiltonians is derived for simulating paths across this manifold which pro- 
vides highly efficient convergence and exploration of the target density. The performance of 
the Riemannian Manifold Hamiltonian Monte Carlo method is assessed by performing posterior 
inference on logistic regression models, log-Gaussian Cox point processes, stochastic volatil- 
ity models, and Bayesian estimation of parameter posteriors of dynamical systems described 
by nonlinear differential equations. Substantial improvements in the time normalised Effective 
Sample Size are reported when compared to alternative sampling approaches. Matlab code 

at http : / /www. dcs . gla . ac.uk/inference/rmhmc allows replication of all results. 

1. Introduction 

For an unnormalised probability density function, p(0) where 6 M. D , the normalised density 
follows as p(0) = p(6)/ J p(0)d0, which for many statistical models is analytically intractable. 
Monte Carlo estimates of integrals with respect to p(9), which commonly appear in Bayesian statis- 
tics, are therefore required (Gilks et ai, 1996). The predominant methodology for sampling from 
such a probability density is Markov chain Monte Carlo (MCMC) see e.g. (Robert, 2004; Gelman 
et ai, 2004; Gilks et ai, 1996; Liu, 2001). The most general algorithm defining a Markov process 
with invariant density p(0) is the Metropolis-Hastings algorithm (Metropolis et ai, 1953; Hasting, 
1970), which is arguably one of the most successful and influential Monte Carlo algorithms (Beichl 
and Sullivan, 2000) . 

The Metropolis-Hastings algorithm proposes transitions 0^0* with density q(0*\0), which 
are then accepted with probability a(0, 0*) = mm{l,p(0*)q(0\0*)/p(0)q(0*\0)} . This accep- 
tance probability ensures that the Markov chain is reversible with respect to the stationary target 
density p(0) and satisfies detailed balance, see for example Robert (2004); Neal (1993a, 1996); Liu 
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(2001). Typically, the proposal distribution q(0*\0) which drives the Markov chain takes the form 
of a random walk, e.g. q(0* \0) = N{0* \0, A) is a D-dimensional Normal distribution with mean 
and covariance A. 

High acceptance rates can be achieved by proposing smaller transitions, however larger amounts 
of time will then be required to make long traversals of parameter space. In high dimensions, when 
D is large, the random walk can become inefficient resulting in low acceptance rates, poor mixing 
of the chain and highly correlated samples. A consequence of this is a small effective sample size 
(ESS) from the chain, see Robert (2004); Gilks et al. (1996); Neal (1996); Liu (2001). Whilst there 
have been a number of suggestions to overcome this inefficiency, guaranteeing detailed balance 
and ergodicity of the chain places constraints on what can be achieved in alleviating this problem 
(Andrieu and Thorns, 2008; Robert, 2004; Neal, 1993a). Design of a good general purpose proposal 
mechanism providing large proposal transitions that are accepted with high probability remains 
something of an engineering art-form. 

Major steps forward in this regard were made when a proposal process derived from a discre- 
tised Langevin diffusion with a deterministic component based on the gradient information of the 
target density was suggested in the Metropolis Adjusted Langevin Algorithm (MALA) (Roberts and 
Stramer, 2003). Likewise the Hybrid Monte Carlo (HMC) method (Duane et al, 1987) was pro- 
posed in the statistical physics literature as a means of efficiently simulating states from a physical 
system which was then applied to problems of statistical inference (Neal, 1993a,b, 1996; Liu, 2001). 
In HMC, a deterministic proposal process is employed along with additional stochastic proposals 
that together provide an ergodic Markov chain capable of making large transitions that are accepted 
with high probability. Given the potential efficiency gains to be obtained in MCMC sampling from 
such proposal mechanisms a brief review of HMC within the context of statistical inference is pro- 
vided in the following section. In Section 3, a generalisation of HMC is presented, which takes 
advantage of the natural Riemannian structure of the parameter space and allows for more efficient 
proposal transitions to be made. Finally, in Sections 4 and 5, this new methodology is demonstrated 
on a number of interesting statistical problems, i.e. Bayesian logistic regression, stochastic volatility 
modeling, log-Gaussian Cox point processes, and parameter inference in dynamical systems. 

2. Hybrid Monte Carlo 

Consider the random variable £ M. D with density p(0) and an independent auxiliary variable 
p £ M. D with density p(p) = Af(p\0, M). Thejoint density follows in factorised form as p(0,p) = 
p(0)p(p) = p{0)N '(p|0,M). Denoting the log of the desired density as C{0) = \ogp(0), the 
negative joint log-likelihood is 

H(0, p) = -C(0) + \ log(27r) D |M| + ^p T M *p (1) 

The physical analogy of this negative joint log-likelihood is a Hamiltonian (Duane et al, 1987; 
Leimkuhler and Reich, 2004), which describes the sum of a potential energy function —C(0) defined 
at the position 0, and a kinetic energy term p T M _1 p/2. The auxiliary variable p is interpreted as 
a momentum variable and the covariance matrix M denotes a mass matrix. 

The score function (Schervish, 1995), with respect to and p, of the log joint density over the 
two random variables has a physical interpretation as the time evolution, with respect to a fictitious 
time r, of the physical system as given by Hamilton's equations, 
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These deterministic equations can be exploited in defining a proposal process for both sets of 
random variables by firstly drawing a sample of p from Af(p\0, M), then numerically integrating 
equation (2) to provide the evolution process in the joint space. If the numerical integrator is such 
that it provides mappings (<?, p) i— > (0* , p*) that are both time-reversible and volume preserving, 
then the use of the Hastings ratio in defining an acceptance probability min[l, oxp{— H(0*, p*) + 
H(0, p)}] produces an ergodic, time reversible Markov chain that satisfies detailed balance and 
whose stationary density is p(6, p) (Duane et al, 1987; Liu, 2001; Neal, 1996). 

The class of explicit symplectic numerical integrators are both time reversible and volume pre- 
serving (Leimkuhler and Reich, 2004) and as such would be appropriate in devising the desired 
Markov chain. The Leapfrog algorithm was introduced as a symplectic integrator in the original 
paper of Duane et al. (1987), and employed in statistical applications e.g. (Liu, 2001; Neal, 1993b) 
as described below, 



Since the joint likelihood is factorisable (i.e. in physical terms, the Hamiltonian is separable), it 
is obvious by inspection that each complete Leapfrog step (equations (3), (4) and (5)) is reversible 
by the negation of the integration step-size, e. Likewise as the Jacobians of the transformations 
(0,p) i— > (0, p + eVe£(0)/2) and (0,p) i— > (0 + eM _1 p, p) have unit determinant then volume is 
preserved, and thus detailed balance will be satisfied in an HMC scheme that employs an acceptance 
ratio min [1, exp{-H(6*, p*) + H(0, p)}]. Random values of p — W(p|0, M) are used prior to 
each deterministic sequence of Leapfrog steps to ensure the full space is explored, and consequently 
the ergodicity of the chain is preserved, see Neal (1996); Liu (2001) for a detailed description of the 
HMC procedure. 

It should be noted that the combination of equations (3) and (4) in a single step of the Leapfrog 
algorithm yields an update of the form 



which is nothing more than a discrete pre-conditioned Langevin diffusion as employed in MALA 
(Roberts and Stramer, 2003) (see Neal (1993a, 1996) for further discussion on this point). 

The ability of HMC to overcome random walks in MCMC sampling suggests it should be a 
highly successful tool for Bayesian inference. A study suggests in excess of 300 papers cite the 
original (Duane et al, 1987) paper within the literature devoted to Molecular Modelling and Sim- 
ulation, Physics and Chemistry. However there are a much smaller number of citations in the lit- 
erature devoted to Statistical Methodology and Application, e.g. (Liu, 2001; Neal, 1996, 1993b; 
Gustafson, 1997; Ishwaran, 1999), indicating that it may have largely passed into desuetude. 

Whilst the choice of the step size e and number of Leapfrog steps can be tuned based on the over- 
all acceptance rate of the HMC sampler, it is unclear how to select the values of the weight matrix 
M in any automated manner that does not require some knowledge of the target density. Although 
rules of thumb are suggested (Liu, 2001; Neal, 1993a, 1996) these typically rely on knowledge of 
the marginal variance of the target density, which is of course not known at the time of simulation 
and thus requires preliminary pilot runs of HMC, this is also the case for MALA although asymp- 
totic settings are suggested in Christensen et al. (2005). The experimental sections of this paper will 
demonstrate how crucial this tuning is to obtain acceptable performance of HMC and MALA. 



p(r + e/2) 
0(t + e) 
P(r + e) 



p(T)+eV e £(0(r))/2 

9{t) + eM-^r + e/2) 

p(r + e/2) + eV e £(6/(r + e))/2 



(3) 
(4) 
(5) 



0(t + e) = 0(t) + yM- 1 V e £(6/(r)) + eM^p^) 



(6) 
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The potential of the HMC methodology may be more fully realised by employing stochastic 
transitions that take into account the local structure of the target density when proposing moves to 
different likelihood regions, as this may improve the overall mixing of the chain. Therefore rather 
than employing a fixed global covariance matrix in the proposal density Af(p\0, M), a position 
specific covariance would be adopted. Furthermore, the deterministic proposal mechanism of HMC, 
when viewed as the deterministic component of the discrete pre-conditioned Langevin diffusion, 
equation (6), relies on the likelihood gradient pre-conditioned by the inverse of a globally constant 
metric tensor i.e. a mass matrix. However, given the Riemannian structure of the parameter space 
of statistical models (Amari, 1990; Kass, 1989) the adoption of the position specific metric tensor 
should yield more effective deterministic transitions in the overall algorithm. The following section 
now formalises both of the above considerations by defining the overall Hamiltonian on the Riemann 
Manifold. 

3. Riemann Manifold Hamiltonian Monte Carlo 

The parameter space of a statistical model possesses a Riemannian structure (Amari, 1990, 1997; 
Kass, 1989; Murray and Rice, 1993), whose invariant metric is defined by the Fisher Information 
(Rao, 1945; Amari, 1990, 1997). Therefore the natural geometric structure of the density model 
p(6) is defined by the Riemannian manifold and associated metric tensor. Zlochin and Baram 
(2001) originally attempted to exploit this manifold structure, however their use of non-symplectic 
numerical integration prevented them from developing an overall HMC procedure and resulted in 
an approximate method of simulation instead of a proper MCMC algorithm, drastically limiting its 
applicability and usefulness. We show how the Riemannian manifold structure may be exploited 
within a correct MCMC framework. This overcomes the difficulties of implementing HMC and 
yields an automated means of tuning the overall method. We begin by considering the definition of 
Hamiltonian dynamics on the Riemannian manifold (Chavel, 1993). 

As the Hamiltonian is — \ogp(6, p) = —C{9) — C(p) the parameterised family of probability 
densities p(0) is a D-dimensional Riemann manifold with metric tensor, G(9), defined by the 
non-degenerate Fisher Information matrix £{V e £(0)Ve£(0) T }, (Rao, 1945; Amari, 1990, 1997). 
From equation (2), it follows that p = M.8, so the norm of each 9 under the metric M follows 
as II^IIm = M0 = p T M *p. In a more general form as the statistical model is defined on 
a Riemannian manifold, the metric tensor defines the position specific norm such that ll^Hg^) = 

6 J G(6)6 = p T G _1 (0)p and thus the kinetic energy term can be defined via the inverse metric. 
In order to ensure that the Hamiltonian can be interpreted as a log-density, the addition of the 
normalising term for the Gaussian is required, i.e. | log(27r)" D |G(£>) |. Therefore, the Hamiltonian 
defined on the Riemann manifold follows as 

H(e,v) = m + \v 1 G(or 1 v (7) 

where <f>(0) = -C{6) + \ log(2 7 r)' D |G(0)| so that exp(-H(0, p)) = p(0,p) = p(6)p(p\6) 
and the marginal density p(9) = J exp(— _ff p))dp is the desired target density. Unlike the 
previous case for HMC this joint density is no longer factorisable and therefore the log-likelihood 
does not correspond to a separable Hamiltonian. The conditional distribution for momentum values 
given parameter values is a zero-mean Gaussian with the point specific metric tensor acting as the 
covariance matrix p(p\6) = Af (p|0, G(6>)), which in part resolves the scaling issues associated 
with HMC and MALA, as will be demonstrated in Sections 4 and 7. The following section develops 
an explicit symplectic integrator for the Riemannian Manifold Hamiltonian Monte Carlo method. 
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Fig. 1. The above contours were plotted from the stochastic volatility model investigated later in 
the paper. The latent volatilities and the parameter j3 are set to their true values, while the log-joint 
likelihood given different values of the parameters a and cf> is shown by the contour plot. The left 
hand plot shows the evolution of a Markov chain using HMC with a unit mass matrix, while the right 
hand plot shows the evolution of a chain from the same starting point using RM-HMC. Note how the 
use of the metric allows RM-HMC to converge much quicker to the target density. 




Sigma Sigma 



Fig. 2. Here we see a close-up of the Markov chain paths shown in Figure 1. It is clear that RM- 
HMC effectively normalises the gradients in each direction, whereas HMC, with a unit mass matrix, 
exhibits stronger gradients along the horizontal direction compared to the vertical direction, and 
therefore takes longer to explore the space fully. A carefully tuned mass matrix may improve HMC 
sampling, while RM-HMC deals with this automatically. 
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3. 1. Symplectic Integration of a Non-Separable Hamiltonian on a Riemann Manifold 
The dynamics of the non-separable Hamiltonian follow as 



dr 
dpi 
dr 



(GfC'p), 



dH 

dpi 

dH _ dC(9) 



(8) 



do* 



2 



G(0) 



86; 



L P (9) 



The Hamiltonian dynamics on the manifold are simulated by solving the continuous time deriva- 
tives and it is straightforward to see that they satisfy Liouville's theorem of volume preserva- 
tion (Leimkuhler and Reich, 2004). However, for the discrete integrator it is not so straight- 
forward. Naively employing the discrete Leapfrog integrator (equations (3), (4) and (5)), as in 
(Zlochin and Baram, 2001), gives transformations of the form (0,p) i— ► (0,p — eip(0,p)) and 
(0, p) i ► {9 -\- ecj)(9, p), p), neither of which admits a Jacobian with unit determinant. In addi- 
tion, it is straightforward to see that reversibility for and p is not satisfied for finite step-size e, 
as G(0(t)) ^ G(0(r + e)) and p(r) T F(0)p(r) ^ p(r + e) T F(0)p(T + e). Therefore proposals 
generated from this integrator will not satisfy detailed balance in a Hybrid Monte Carlo scheme. 
What is required is a symplectic numerical integrator for solving this non-separable Hamiltonian to 
ensure a correct MCMC algorithm. Fully implicit integrators are available, however these require 
further numerical solution of the associated implicitly defined equations. A general purpose non- 
implicit symplectic integrator for non-separable Hamiltonians with position specific mass matrices, 
as defined by the Riemannian metric tensor, is desirable. We have developed such an integrator 
and the detailed derivation can be found in Appendix A. This resulting algorithm is employed in 
defining the Riemannian Manifold Hamiltonian Monte Carlo (RM-HMC) procedure below. 



3.2. The Overall RM-HMC Algorithm 

The overall proposal generating process, (po,0o) l— * (p>0)> is denoted by the vector function 
(p, 9) = f (po, 0o, e , N2) where po, 0o are the current momentum and parameter values respec- 
tively, e is the integration step size and N2 is the number of iterations of step (12) below. We denote 
by iVi the number of repeated applications of the vector function f to obtain a proposal. Scheme 1 
for the full symplectic integrator has the following five steps 



Pi = po - eV e <X0 o )/2 (10) 

p 2 = g(0 o ,pi,e/2) (11) 

9* = 0o + eG- 1 (0 o )p 2 , (12) 

P3 = g(0*,P2,e/2) (13) 

p* = p 3 -eV e 0(0*)/2 (14) 



where the vector function g(0, p, e/2) is defined in Appendix A. 2.1. We note that this function 
requires the matrix of derivatives of the metric tensor with respect to each parameter. This may be 
derived analytically for a number of applications, as shall be demonstrated shortly. The repeated 
application of the function (p*, 0*) = f(po, 0o, e, N2) provides the means to obtain a deterministic 
proposal that is guided not only by the derivative information of the target density, as in HMC 
or MALA, but also exploits the local geometric structure of the manifold as determined by the 
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metric tensor. Intuitively, comparing (4) and (12) shows that the constant mass matrix M, defining 
a globally constant metric, is now replaced with the position specific metric thus removing the 
requirement to tune the values of the elements of M, which so dramatically affects the performance 
of HMC. Since the integration scheme detailed above is symplectic, it satisfies the requirements of 
both volume preservation and reversibility. Thus employing this as a proposal process provides a 
correct MCMC scheme satisfying detailed balance and convergence to the desired target density. 
It should be noted that this symplectic integrator is a generalisation of the Leapfrog method; if the 
metric tensor is globally constant and independent of position then equations (10) to (14) reduce 
exactly to the leapfrog equations (3) to (5). Pseudo-code describing the overall RM-HMC sampling 
scheme is described below. 

Algorithm 1 Overall RM-HMC Sampling Scheme 

Require: e, N samples , N u N 2 , 9(0), p(0) 
for i = 1 to N samples do 
p~JV(O,G(0(i-l))) 

(p*(O),0*(O))«-(p,0(i-l)) 
for j = 1 to Ni do 

(p*(j), e*{j)) - f( P *(j - 1), e*(j - 1), e, n 2 ) 

end for 

(p*,0*) <- (p(N steps ),9(N steps )) 
draw a <~ Uniform(0 7 1) 

if a < min{l,exp(-H(0* ,p*) + H(0(i - l),p))}then 

e{i) <- e* 

else 

©(<) <— 0(* - 1) 
end if 
end for 



Figures 1 and 2 provide an intuitive visual demonstration of the differences in HMC and RM- 
HMC when converging to and sampling from a target density. Matlab code for RM-HMC is avail- 
able at http://www.dcs.gla.ac.uk/inference/rmhmc. To illustrate the RM-HMC 
sampling scheme and evaluate performance against alternative MCMC methods, a number of exam- 
ple applications are presented. We begin with posterior sampling for Logistic Regression models. 

4. RM-HMC for Bayesian Logistic Regression 

Consider an N x D design matrix X comprising TV samples each with D covariates and a binary 
response variable t e {0, 1}^. Denoting the logistic link function as er(-), a Bayesian logistic 
regression model of the binary response (Gelman et al, 2004; Liu, 2001) is obtained by the intro- 
duction of regression coefficients (3 € R D with an appropriate prior, which for illustrative purposes 
is given as (3 ~ Af(0, aT) where a is given. Neglecting constants, the log joint-likelihood follows 
in standard form as 

i N i 

C((3) - —(3 J (3 = /3 T X T t - lo g(! + cx P (/3 T xT .)) - — (3 J (3 (15) 

n— 1 

where X nr denotes the vector that is the n th row of the TV x D matrix X. The derivative of the 
log joint-likelihood is V£(/3) - or 1 ^ with Fisher Information £{V£(/3)V£(/3) T } = X T AX. 
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Table 1. Summary of datasets for logistic regression 


Name 


Covariates (D) 


Data Points (N) 


Dimension of f3 (b) 


Pima Indian 


7 


532 


8 


Australian Credit 


14 


690 


15 


German Credit 


24 


1000 


25 


Heart 


13 


270 


14 


Ripley 


2 


250 
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The diagonal N x N matrix A has elements A ni „ = cr(/3 T X^ .)(1 — cr(/3 T X^ .)). To capture 
prior informativeness in the metric tensor we follow (Tsutakawa, 1972; Ferreira, 1981) and sum 
the Fisher Information with the prior precision to define the overall metric tensor for the model 
as G(/3) = X T AX + and finally the derivative matrices of the metric tensor take the form 
dG(/3)/dPi = X T AV < X where the iVx N diagonal matrix V* has elements (l-2cr(f3 J X J n .))X m . 
The above identities are all that are required to define both HMC and RM-HMC sampling methods, 
which will be illustrated in the following experimental section. 

4. 1. Experimental Results for Bayesian Logistic Regression 

We present results from the analysis of 5 datasets (Michie et ah, 1994; Ripley, 1996), summarised 
in Table 1 . These datasets exhibit a wide range of characteristics which provides a challenging test 
for any applied sampling method; the number of covariates ranges from 2 to 24, the number of 
data points ranges from 250 to 1000, and the standard deviations of the induced marginal posterior 
distributions range from 0.0004 to 3. We investigate the use of RM-HMC applied to this problem 
and also implement the following sampling methods for comparison: - 

(a) Component- Wise Adaptive Metropolis-Hastings (Robert, 2004) 

(b) Joint Updating Gibbs Sampler (Holmes and Held, 2005) 

(c) Metropolis Adjusted Langevin Algorithm (Roberts and Stramer, 2003) 

(d) Hybrid Monte Carlo (Duane et ah, 1987; Neal, 1993a; Liu, 2001) 

Given each dataset we wish to sample from the posterior distribution over the regression co- 
efficients /3, and in each experiment wide Gaussian prior distributions were employed such that 
niPi) ~ ■Af(O) 100). A linear logistic regression model with intercept was used for each of the 
datasets with the exception of the Ripley dataset, for which a cubic polynomial regression model 
was employed. 

Each method was run 10 times with every dataset and the average results were recorded. We re- 
produce the results of Holmes and Held (2005) by allowing 5000 burn in iterations so that each 
sampler reaches the stationary distribution and has time to adapt as necessary. The next 5000 
iterations were used to collect posterior samples for each of the methods and the CPU time re- 
quired to collect these samples was recorded. Each method was implemented in the interpreted 
language Matlab to ensure fair comparison. We compared the relative efficiency of these meth- 
ods by calculating the effective sample size (ESS) using the posterior samples for each covariate, 
ESS = N(l + 2 J2k 7W) _1 w h ere N is the number of posterior samples and J2k 7C0 ^ s tne sum 
of the K monotone sample autocorrelations as estimated by the initial monotone sequence estimator 
(see Geyer (1992)). The standard error around the mean ESS was less than 2 x 10~ 2 for all results. 
Such an approach was also taken by Holmes and Held (2005), in which they report the mean ESS, 
averaged over each of the covariates. However, we feel this could give a rather inflated measure of 
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the true ESS, since ideally we want a measure of the number of samples which are uncorrected over 
all covariates. In this paper we therefore report the minimum ESS of the sampled covariates. This 
minimum ESS is then normalised relative to the CPU time by calculating the time taken to obtain 1 
sample which is effectively uncorrected across all covariates. 

4.2. Comparative MCMC Sampling Methods 

We employed an adaptive Metropolis-Hastings (M-H) scheme, such that each covariate was updated 
individually with its stepsize being adapted in every 100 iterations during burn-in to achieve an 
acceptance rate of between 20% and 40%. The stepsize was then fixed when sampling from the 
posterior distribution. 

The auxiliary variable Gibbs sampler of Holmes and Held (2005) was implemented with a joint 
update of {z,/3}, where z e M. N is the auxiliary variable designed to improve mixing of the co- 
variate samples. We implemented the algorithm based on the very detailed pseudo-code given in 
the appendix of their paper, and in contrast to the M-H algorithm this method has the advantage of 
requiring no tuning of parameters. The main computational expense however is in the repeated sam- 
pling from truncated normal distributions, for which we implemented code based on the efficient 
method defined in Johnson et dl. (1999). 

We implemented a MALA sampler with proposed covariates being drawn from the multivariate 
normal distribution M (/3 + V log(7r{/3})/i/2, hip), where Id is the D-dimensional identity matrix 
and h controls the scaling of the proposal variance. We follow the advice of Roberts and Rosenthal 
(1998) by scaling h like 0(D~3^ where D is the number of covariates, such that we achieve an 
acceptance rate of between 40% and 60%. 

Hybrid Monte Carlo has promised to offer more efficient sampling from high dimensional prob- 
ability distributions by effectively reducing the amount of random walk present in the parameter 
values being proposed. This has indeed been shown to be the case for relatively simple, although 
high-dimensional, multivariate normal distributions, however there has been little application to 
more complex data models. We believe the reason for this lies in the amount of tuning required to 
obtain reasonable mixing and rates of acceptance, as will be highlighted in the following section. 
The two main parameters which require tuning are the number of leapfrog steps, N\, and the size 
of each leapfrog step, e. It has been suggested that choosing the leapfrog stepsize to be inversely 
proportional to the marginal standard deviation of the target distribution along each dimension dras- 
tically improves mixing, particularly when such marginals are of greatly varying orders of magni- 
tude. Setting different leapfrog stepsizes along different directions can be equivalently encoded in 
the so-called mass matrix (Neal, 1993a, 1996). 

This approach clearly requires advance knowledge of the distribution being sampled from, and 
in a practical setting this information is very rarely available. The use of exploratory runs of a 
Metropolis sampler to obtain initial estimates of the target distribution has been suggested (Hajian, 
2007), however there is the obvious associated computational cost and the fact that this may not be 
feasible for very complex distributions. 

Following the advice of Neal (1993a, 1996), we fix the size of each leapfrog step e to a value 
slightly smaller than the smallest marginal standard deviation of the model parameter posteriors, and 
set the number of leapfrog steps N\ such that the maximum distance that can be travelled in a single 
move, eiVi, is larger than the largest standard deviation of the marginal parameter distributions. A 
larger step size would result in large rejection rates, while a smaller number of steps would result in 
very slow exploration of the target distribution. 

In our experiments we assume this information is known when implementing HMC, presum- 
ably after a number of exploratory runs of the algorithm, and set e small enough to obtain a high 
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Table 2. RM-HMC with integration scheme 1 - investigating the effect 
of parameter settings on sampling efficiency 



eNi Max e 


iVi 


N 2 


Mean Time (s) 


Min ESS 


s/Min ESS 


1 1/5 


5 


1 


385.8 


381 


1.01 


1 1/5 


5 


2 


393.2 


350 


1.12 


2 1/10 


20 


1 


1332.5 


1555 


0.86 


2 1/10 


20 


2 


1344.2 


1525 


0.88 


3 1/10 


30 


1 


1949.7 


2795 


0.70 


3 1/10 


30 


2 


1983.3 


2389 


0.83 


Table 3. RM-HMC with integration scheme 2 


- investigating the effect 


of parameter settings on sampling efficiency 






eNi Max e 


iVi 


iV 2 


Mean Time (s) 


Min ESS 


s/Min ESS 


1 1/10 


10 


1 


378.9 


278 


1.36 


1 1/10 


10 


2 


387.9 


239 


1.62 


2 1/10 


20 


1 


702.2 


774 


0.91 


2 1/10 


20 


2 


732.6 


457 


1.60 


3 1/15 


45 


1 


1567.9 


1624 


0.97 


3 1/15 


45 


2 


1580.3 


894 


1.77 



acceptance rate (> 70%) and eNi w 3 allowing the chain to traverse a distance larger than the stan- 
dard deviation of the largest marginal posterior for all datasets, see Table 9. This approach works 
well for distributions in which the marginal standard deviations are of a similar magnitude, however 
the algorithm soon becomes computationally very expensive to run in situations where they greatly 
differ and the number of leapfrog steps required for adequate mixing consequently becomes very 
large. 

4.3. Investigating RM-HMC 

We begin by investigating the RM-HMC method in detail for the most challenging of our five 
datasets, German Credit, which consists of 24 covariates and 1000 datapoints. We then compare the 
results for all five datasets employing the alternative sampling methods described in the previous 
section. 

As previously mentioned, the evolution operator of the RM-HMC method may be obtained 
to second order by splitting the non-separable Hamiltonian (see Appendix A), and Scheme 1 has 
already been presented. The alternative way of splitting the Hamiltonian as detailed in the appendix 
yields a slightly different but equally valid integration scheme, Scheme 2. The main computational 
burden of both integration schemes is incurred in calling the function p = g(0,p o , e). Scheme 
1 calls the function g twice per iteration with stepsize e/2, whereas Scheme 2 calls g only once 
per iteration, but with larger stepsize, e. We investigate both integration schemes to determine 
which is computationally more efficient in terms of time taken per (effectively) independent sample, 
calculated as Time/(Min ESS). The three parameters that may be altered in the RM-HMC algorithm 
are the integration stepsize, e, the number of steps per integration, N\ and the number of inner steps 
per integration, A 2 which update 9. The maximum total distance which a chain may travel in a 
single proposed move is given by eNi, and for any given value of eAi we chose e small enough 
such that the acceptance ratio was above 70% and then adjusted Ai appropriately. Tables 2 and 3 
show the results of the two schemes using a variety of choices for these parameters, which allow us 
to make the following observations. 
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Firstly, it is clear that using more than 1 inner step provides very little improvement in sampling 
efficiency, and may even be detrimental in some instances (we conjecture this is due to overshoot- 
ing, which commonly occurs when the log-density is locally non-quadratic). Secondly, we found 
that sampling generally became more efficient as the maximum total distance travelled by a chain, 
eNi, was increased, i.e. when the chain was able to traverse a distance greater than the width of 
each marginal distribution. Note that Scheme 2 required a smaller step size for eNi = 3, which 
impacted negatively on efficiency. Thirdly, using Scheme 1 it was sometimes possible to perform 
the integration using a larger step size, which meant that a smaller number of integration steps was 
required to move the chain a set distance, compared to Scheme 2 requiring a smaller stepsize and 
larger number of iterations. Therefore, even though Scheme 1 requires two calls to the function g 
per iteration, it appears to be computationally more efficient per independent sample than Scheme 
2. This efficiency is likely also a result of the integration scheme computing the rate of change of 
the metric tensor more accurately, since it calls the function g twice per iteration but with half a 
stepsize. 



4.4. Comparison of RM-HMC 

Following the guidelines given in the previous section, we find that the RM-HMC sampling method 
works very well for a variety of datasets and is fairly robust to the choice of algorithm parameters. 
For comparison with the alternative sampling methods, we chose the settings for RM-HMC based 
on the above analysis. We employed Scheme 1 with 1 inner step, setting e for each dataset equal to 
the smallest stepsize for which the acceptance rate was reasonably high (> 70%), and the number 
of integration steps such that eN x « 3. We repeated the sampling experiments 10 times and aver- 
aged the results, which are shown for each of the datasets in Tables 4 to 8. It is interesting to see 
that MALA generally performs poorly. Whereas all other methods converge within 5000 burn-in 
iterations for all datasets, MALA needs as many as 2 million iterations to converge due to the very 
small stepsize required to achieve an acceptance ratio above 40%. This is particularly the case for 
the Australian Credit and Heart datasets, which exhibit very large differences in scale between the 
largest and smallest marginal standard deviations (see Table 9), resulting in extremely slow explo- 
ration of the target distribution, indeed even after 2 million iterations the Langevin guided chains 
had still not reached their stationary distributions. Clearly some method of scaling the regression 
coefficients would improve the mixing, however this is again unfeasible unless information regard- 
ing the marginal posterior distributions is known in advance. Similarly the standard HMC method 
fails to converge for the Australian Credit dataset, since the stepsize is so small that the number of 
integration steps required becomes computationally impractical to implement. Figure 3 shows the 
trace and autocorrelation plots for 1000 posterior samples using the Heart dataset. The difference 
in autocorrelation is quite striking, both from inspection o f the traces and from examination of the 
autocorrelation plots themselves. The autocorrelation of the RM-HMC samples drop towards zero 
far quicker than for any of the other methods. 

In our simulations, RM-HMC outperforms all of the other methods using every dataset. It is in- 
teresting to note that due to the dense matrix form of the metric tensor and its inverse computational 
cost of RM-HMC on this example will not scale favourably and it can be seen it outperforms by the 
smallest margin on the German Credit dataset, which has largest number of regression coefficients 
(b = 25) and the largest number of data points (N = 1000). A further example based on a stochas- 
tic volatility model is now considered where the metric tensor and its inverse are sparse permitting 
scaling of RM-HMC to very high dimensions. 



Table 4. Australian Credit Dataset, D = 14, N = 690, 1 5 regression coefficients 
- Comparison of sampling methods 



Method 


Time 


ESS (Min, Med, Max) 


s/Min ESS 


Rel. Speed 


Metropolis 


16.5 


(15, 199, 698) 


1.10 


xlO.7 


Aux. Van 


562.9 


(48, 1087, 1457) 


11.73 


xl 


MALA 


No Convergence 


(-, -, -) 






HMC 


No Convergence 


(-, -, -) 






RM-HMC 


82.6 


(4769, 5000, 5000) 


0.0173 


x678 



Table 5. German Credit Dataset, D = 24, N = 1000, 25 regression 
coefficients - Comparison of sampling methods 

Method Time ESS (Min, Med, Max) s/Min ESS Rel. Speed 

Metropolis 449 (10, 81, 604) 449 xl 

Aux. Var. 831.2 (1089,2164,2655) 0.76 x5.9 

MALA 7.7 (3,5,175) 2.57 xl.7 

HMC 3161.6 (2707,4201,5000) 1.17 x3.8 

RM-HMC 892.3 (2264,3084,3717) 0.39 xll.5 



Table 6. Pima Indian Dataset, D = 7, N = 532, 8 regression coeffi- 
cients - Comparison of sampling methods 



Method 


Time 


ESS (Min, Med, Max) 


s/Min ESS 


Rel. Speed 


Metropolis 


6.5 


(14, 35, 181) 


0.46 


X21.7 


Aux. Var. 


468.6 


(1138, 1957, 2397) 


0.41 


X24.3 


MALA 


29.9 


(3, 10, 39) 


9.97 


xl 


HMC 


1499.1 


(3149, 3657, 3941) 


0.48 


X20.8 


RM-HMC 


29.3 


(4981, 5000, 5000) 


0.006 


X1662 



Table 7. Heart Dataset, D = 13, N = 270, 14 regression coefficients - Compari- 



son of sampling methods 



Method 


Time 


ESS (Min, Med, Max) 


s/Min ESS 


Rel. Speed 


Metropolis 


10.4 


(7, 63,516) 


1.49 


x3.7 


Aux. Var. 


215.7 


(722, 1275, 1719) 


0.30 


X18.3 


MALA 


No Convergence 


(-, -, -) 






HMC 


2018 


(368, 2740, 2938) 


5.48 


xl 


RM-HMC 


85.9 


(3371,4031,4519) 


0.025 


x219 



Table 8. Ripley Dataset, D = 2, N = 250, 7 regression coefficients - 



Comparison of sampling methods 



Method 


Time 


ESS (Min, Med, Max) 


s/Min ESS 


Rel. Speed 


Metropolis 


4.1 


(9, 18, 248) 


0.46 


X5.9 


Aux. Var. 


175.9 


(68, 373, 2008) 


2.59 


xl 


MALA 


1.8 


(4, 7, 27) 


0.45 


x5.7 


HMC 


52.8 


(1365, 1596, 1754) 


0.039 


X66.4 


RM-HMC 


58.3 


(3586, 4106, 4522) 


0.016 


xl62 
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Table 9. Summary of standard deviations of the marginal posterior 
distributions for each dataset 



Dataset Smallest Marg. S.D. Largest Marg. S.D. Ratio 



Pima Indian 


0.0043 


0.9646 


225 


Australian Credit 


0.00017 


1.0667 


6404 


German Credit 


0.0038 


1.1492 


303 


Heart 


0.004 


2.9221 


739 


Ripley 


1.2575 


7.556 


6 
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Fig. 3. Trace plots for 1 000 posterior samples with the Heart dataset using Metropolis (top), auxiliary 
variable sampler (second top), standard HMC (second bottom), and RM-HMC (bottom). Autocorre- 
lation plots are also shown for one of its parameters, which may be seen in the trace plots to have a 
mean of around -7. 
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5. RM-HMC for a Stochastic Volatility Model 

A stochastic volatility model (SVM) studied in Liu (2001) is defined with the latent volatilities 
taking the form of an AR(1) process such that y t — e t (3 exp (x t /2) with x t +i — (f>x t + f]t+i where 
e t ~ 7V(0, 1), r\ t ~ 7V(0, a 2 ) and x x ~ Af(0, a 2 /(l - 4> 2 )) having joint likelihood 

P{y, x, P, <j>, a) = Y[ t=1 P(yt\x t , P)p(xi) Y[ t=2 p(x t \x t -i, <f>, cr)7r(/3)7r(0)7r(cr). (16) 

We may conveniently split up the sampling procedure into two steps, which as we shall see 
allows the implementation of RM-HMC in a computationally efficient manner. Firstly we may 
simulate <f>, a, /3 from p(f3, <j>, a\y, x), where the priors, as in Liu (2001), are chosen to be p((3 2 ) cx 
f3~ 2 , a 2 ~ Inv-x 2 (10,0.05) and (0 - l)/2 ~ Beta(20, 1.5). Secondly we may sample the latent 
volatilities by simulating from the conditional p(x|y, f3, <fi, a). We shall consider the use of RM- 
HMC, HMC and MALA for the purpose of sampling both the parameters and latent volatilities. 

5. 1. RM-HMC for SVM Parameters 

We introduce a transformation of the parameters to ensure that they are suitably bounded, a = 
exp(7) and = tanh(a), and we obtain expressions for the transformed priors and log-joint like- 
lihood accordingly. We now require the partial derivatives of the log-joint likelihood with respect 
to the parameters, as well expressions for the metric tensor and its partial derivatives, in order 
to implement the RM-HMC, HMC and MALA methods. All of these quantities may be obtain 
straightforwardly (see Appendix B for details). In particular, the Fisher Information is given by 

" W 

T + 1 20 

2(f> 2 (3-T) + (T — 1) _ 

where T is the number of observations. The metric tensor follows by adding this Fisher Infor- 
mation to the prior precision. Having transformed the priors on (3, a and <p into valid priors on 
j3, 7 and a, we may now use any of these methods to draw samples from the conditional posterior 
p(/3, 7, a|y, x, ), and transform the posterior samples to obtain (3, a = cxp(j) and (f> — tanh(a). 

5.2. RM-HMC for SVM Latent Volatilities 

For all three gradient based methods, we require the gradient of the joint-log likelihood with respect 
to each of the latent volatilities. Defining the vectors u = (x 3 , ■ ■ ■ , xt) j , v = (x 2 , ■ ■ ■ , xt-i) j , 
w = -%(u — (j>w), s = (si, ■ ■ ■ , st) T suc h that = 0.5(1 — yfP~ 2 exp(— a;j)), 8\ = —<j~ 2 (xi — 
4>X2), and 5t = —o~ 2 {x T — 4>Xt-i), we define the vector r = {5\, w T , S 2 ) J and the required 
gradient is V x logp(y, x|/3, 0, a) = V x £ = s - r. 

To devise an RM-HMC sampler for the latent volatilities, x, we also require an expression 
for the metric tensor and its partial derivatives with respect to the latent volatilities. For the data 
likelihood of the model, p(y|x, (3), the Fisher Information is a diagonal matrix with 0.5 for each 
element denoted as I0.5. The latent volatility is an AR(1) process having covariance matrix C with 
elements E{x t + n x t } — 0' n 'cr 2 /(l — 4> 2 ) and as in the previous examples the metric tensor is defined 
as the sum of the Fisher Information and prior precision, G = I 5 + C _1 , conditional on current 
values of a, 0, j3. Now the expression for the covariance matrix is completely dense and is therefore 
computationally expensive to manipulate. Fortunately, this AR(1) process admits a simple analytic 
expression for the precision matrix in the form of a sparse tridiagonal matrix, such that the diagonal 



Riemannian Manifold Hamiltonian Monte Carlo 1 5 



Table 10. 2000 simulated observations with f3 = 0.65, a = 0.15 and 4> = 0.98 - Compari- 
son of sampling the parameters f3, a and <f> after 20,000 posterior samples averaged over 
10 runs 



Method 


Mean Time 


ESS (/3,a,4>) 


S.E. ((3,<r,4>) 


s/(Min ESS) 


Rel. Speed 


MALA 


58.2 


(12.7, 10.0, 33.2) 


(3.2,1.3,5.2) 


5.8 


xl.7 


HMC 


996 


(101,104,212) 


(11.7,8.1,20.4) 


9.9 


xl 


RM-HMC 


368 


(224, 127, 300) 


(19,8.8,31) 


2.9 


x3.4 



Table 11. 2000 simulated observations with f3 = 0.65, a = 0.15 and 4> = 0.98 
- Comparison of sampling the latent volatilities after 20,000 posterior samples 
averaged over 1 runs 



Method 


Mean Time 


ESS (min, median, max) 


s/(Min ESS) 


Rel. Speed 


MALA 


58.2 


(10.3,15.7,26.1) 


5.7 


xl 


HMC 


996 


(278,400,904) 


3.6 


xl.6 


RM-HMC 


368 


(558,877,1698) 


0.66 


x8.6 



elements are equal to (1 + <f> 2 )/ a 2 , with the exception of the first and last diagonal elements which 
are equal to l/cr 2 , and the super and sub diagonal elements are equal to —(f) /a 2 . Thus the metric 
tensor also has a tridiagonal form. For large numbers of observations this sparse structure allows 
great gains in computational efficiency, since the inverse of this tridiagonal metric tensor may be 
computed in 0(n) as opposed to the usual 0(n 3 ). We note that computationally efficient methods 
for manipulating tridiagonal matrices are automatically implemented by the standard routines in 
Matlab. 

We notice that the metric tensor in this case is not a function of x and so the associated par- 
tial derivatives with respect to the latent volatilities are zero. In this case a one step RM-HMC 
integration scheme collapses to 

2 

x = x + yG- 1 V x £ + eVGF T p (17) 

where p ~ N(0, 1) which is a discrete Langevin iteration that is preconditioned by the constant 
matrix G _1 . It is clear that this preconditioning will improve both the mixing and overall ESS, 
see (Lambert and Eilers, 2009) for a recent application of this type of preconditioning in MALA. 
We point out that in the case of RM-HMC the preconditioning matrix emerges naturally from the 
underlying geometric principles of RM-HMC. 

5.3. Experimental Results for Stochastic Volatility Model 

We now compare the computational efficiency of RM-HMC, HMC and MALA for sampling both 
the parameters and the latent variables of the stochastic volatility model as previously defined. 2000 
observations were simulated from the model with the parameter values (3 — 0.65, (7 = 0.15 and <f> = 
0.98 as given in Liu (2001). Using this data, 20000 posterior samples were collected after a burn-in 
period. This sampling procedure was repeated 10 times. The efficiency was compared in terms 
of time normalised ESS, as in the previous section, for the parameters and the latent volatilities. 
MALA was tuned such that the acceptance ratio was between 40% and 60%, and it was necessary 
to use a different tuning for the transient phase than for the stationary phase. HMC was implemented 
using a step size of 0.015 and 200 integration steps per proposal. RM-HMC was implemented using 
a stepsize of 0.5 and 10 integration steps per parameter proposal, and a stepsize of 0.1 and 50 
integration steps per volatility proposal. 
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Fig. 4. Posterior histograms for f3, a and 4> respectively, employing RM-HMC to draw 20,000 samples 
of the parameters and latent volatilities using a simulated dataset consisting of 2000 observations. 
The true values are (3 = 0.65, a = 0.15 and <j> = 0.98. 

RM-HMC gives the best performance both in terms of sampling the parameters and also the 
latent volatilities. In particular it runs faster than HMC, partly because of the computationally 
efficient tridiagonal structure of the metric tensor and partly because RM-HMC follows the con- 
travariant tensor gradient through the parameter space and explores regions of the target density 
more quickly than HMC, refer to Figure 1 and 2 for an illustration of the contrast between HMC 
and RM-HMC sampling of the parameters of this model. MALA exhibits a very poor ESS, however 
the computation time is also extremely small compared to the other two methods and so, based on 
the normalised ESS, MALA does not perform quite as badly as the unnormalised ESS values alone 
might suggest. It should again be noted that in addition to RM-HMC outperforming HMC and 
MALA, RM-HMC requires very little tuning compared to the other methods; unlike MALA it does 
not require different tuning in different parts of the parameter space, and unlike HMC it requires no 
manual setting of a mass matrix. 

We now consider an example where the target density is extremely high dimensional, which is 
encountered when performing inference using spatial data modeled by a log-Gaussian Cox process. 

6. RM-HMC for Log-Gaussian Cox Point Processes 

RM-HMC is further studied using the example of inference in a log-Gaussian Cox point process 
as detailed in (Christensen et al, 2005). This is a particularly useful example in that the target 
density is of high dimension with strong correlations and provides a severe test of MCMC capa- 
bility. The data, model and experimental protocol as described in (Christensen et al, 2005) is 
adopted here. A 64 x 64 grid is overlayed on the area [0, l] 2 with the number of points in each 
grid cell denoted by the random variables Y = {lij} which are assumed conditionally indepen- 
dent, given a latent intensity process A(-) — {A(i,j)}, and are Poisson distributed with means 
mA(i, j) = m exp(Xij), where m = 1/4096. The random variable X = { X{ j } is a Gaussian pro- 
cess with mean E{x} = /il and covariance function ^(i,j),u>j') — cr 2 exp( — <5(z, -j', j, j')/ 64/3), 
where S(i, j') = y/ (i — i') 2 + {j — j') 2 - The joint density is 

n64 j 
exp{yijXij - mexp(a; i ,)}exp(-(x-/il) S (x-/il)/2) (18) 

Denoting C = logp(y|x, fi, a, /3), e = {m exp(xjj)} then V x £ = y — e. The Fisher Information 
matrix is E = diag(e), with the addition of the prior precision matrix the metric tensor and its 
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Fig. 5. Trace plots of the log joint-likelihood for the first 5000 samples of the latent variables of a log- 
Gaussian Cox process. The left hand plot shows the convergence of the RM-HMC scheme which 
is able to directly sample the latent variables x without the need for ad-hoc reparameterisations and 
pilot runs for fine-tuning. The middle plot shows the log joint-likelihood for samples drawn by MALA 
using a reparameterisation of the latent variables. The scaling was carefully tuned to allow traversal 
of the parameter space to the posterior mode. The right hand plot shows the trace of the MALA 
sampler tuned for optimally sampling from the posterior mode. We note that the algorithm is now 
unable to traverse the parameter space when initialised away from this mode. Such fine-tuning and 
reparameterisation is frequently necessary when employing MALA and the HMC sampler. 



partial derivatives follow as G(x) = E + and 9G(x) / dx^j — E' where E' is a zero matrix 
with m exp(xi.j ) as the (64 x (i — 1) + j) th element on the diagonal. 

Noting that the metric tensor has dimension 4096 x 4096 the 0(N 3 ) operations required in the 
RM-HMC scheme are clearly going to be computationally costly. However, it should be noted that 
in previous studies of this Log-Gaussian Cox process, (Christensen et al., 2005), a transformation 
of the latent Gaussian field is necessary based on the Cholesky decomposition of + diag(x), 
which will therefore also scale as 0(N 3 ). 



6. 1. Experimental Results for Log-Gaussian Cox Processes 

Following the example given by Christensen et al. (2005), we fix the parameters (3 = 1/33, 
a 2 = 1.91 and \i = log(126) — <r 2 /2. We generate a latent Gaussian field, x, from the Gaussian 
process and use these values to generate count data y from the latent intensity process A. Given the 
generated data and the fixed parameters, we infer x using RM-HMC and the Langevin method as in 
Christensen et al. (2005). 

The Langevin based method (MALA) is particularly sensitive to the choice of scaling, as is 
HMC, and very often a reparameterisation of the target density is required for these methods to be 
effective. Indeed this is seen to be the case with this particular example, where MALA is unable 
to sample x directly. We therefore follow Christensen et al. (2005) and employ the transformation 
X = fil + Lr, where L is obtained by Cholesky factorisation such that {£ — diag(x)} -1 = 
LL T . Even after this re-parameterisation, it is still necessary to carefully tune the scaling factor 
for this method to work at all. This challenging aspect of employing MALA has been investigated 
in detail by Christensen et al. (2005) who characterise the problem very well, advise great care 
in its implementation, but ultimately are unable to offer any panacea. In contrast to the necessary 
transformation and fine-tuning required by MALA, RM-HMC allows us to directly sample the latent 
variables x without reparameterising the target density. Additionally, no manual tuning via pilot runs 
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is required since the stepsize of the symplectic integrator may be adjusted automatically based on 
the acceptance rate. 

Figure 5 shows the traces of the log joint-likelihood for both methods using the starting position 
Xij = /Lt for i, j — 1, . . . , 64. Note that for MALA these starting positions must be transformed into 
corresponding values for T. The RM-HMC sampler quickly converges to the true mode after very 
minimal automatic tuning of the integration stepsize based on the acceptance rate. MALA converges 
in a similar number of iterations, but only for a suitable choice of scaling factor. The middle plot 
in Figure 5 shows convergence when the scaling factor is carefully tuned for the transient phase of 
the Markov chain, however the right hand plot demonstrates how it fails to converge at all given 
a scaling factor which is tuned for stationarity. In this example the RM-HMC method required 10 
seconds per sample compared to the 6 seconds needed by MALA, however this does not take into 
account the often considerable time and effort required to tune MALA. The algorithms were run on 
a single AMD Opteron processor with 8GB of memory and were coded in Matlab. 

Inferring the latent field of a log-Gaussian Cox process with a finely grained discretisation is 
clearly a very challenging problem due to the high dimensionality and strong spatial correlations 
present between the latent variables. The major challenges associated with employing MALA are 
firstly finding a suitable reparameterisation of the target density, and secondly making a suitable 
choice for the scaling factor according to whether the Markov chain is in a transient or station- 
ary regime. In contrast, RM-HMC does not exhibit such extreme technical difficulties. We have 
demonstrated that RM-HMC is able to sample directly from the original target distribution with 
minimal automatic tuning and effort, albeit with a slightly increased computational cost. We will 
now turn our attention to the very topical application of statistical inference to nonlinear differential 
equations. 

7. RM-HMC for Nonlinear Differential Equation Models 

An important class of problems recently gaining attention is the statistical analysis of uncertainty in 
dynamical systems defined by a system of nonlinear differential equations (Ramsay et ai, 2007; 
Calderhead et ai, 2009; Vyshemirsky and Girolami, 2008). A dynamical system may be de- 
scribed by a collection of N nonlinear ordinary differential equations and model parameters 
which define a functional relationship between the process state, x(t), and its time derivative 
such that x(t) = f(x, 0,t). A sequence of process observations, y(t), are usually contami- 
nated with some measurement error, which is modeled as y(t) = x(t) + e(t), where e(t) de- 
fines an appropriate multivariate noise process, e.g. a zero-mean Gaussian with variance o\ for 
each of the N states. If observations are made at T distinct time points, the N x T matrices 
summarise the overall observed system as Y = X + E. In order to obtain values for X, the 
system of ODEs must be solved, so that in the case of an initial value problem X(#,x ) de- 
notes the solution of the system of equations at the specified time points for the parameters 6 
and initial conditions xo. The posterior density follows by employing appropriate priors such that 
p(0,x o ,<7|Y) oc 7r(0)7r(xo)7r(<T)n„^(Y n ,|X(0,x o )„,.,Ia2). The desired marginal p(9\Y) 
can be obtained from this joint posterior. 

Various sampling schemes can be devised to sample from the joint posterior. However, regard- 
less of the sampling method, each proposal requires the specific solution of the system of differen- 
tial equations. This is the main computational bottleneck in running an MCMC scheme for models 
based on differential equations. The computational complexity of numerically solving such a sys- 
tem cannot be easily quantified since it depends on many factors such as the type of model and its 
stiffness, which in turn depends on the specific parameter values used. In Calderhead et al. (2009) 
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an MCMC methodology was proposed, similar in spirit to the work of Ramsay et al. (2007), which 
sidesteps the issue of solving the system of equations within an MCMC routine by introducing aux- 
iliary Gaussian Process (GP) functions (Williams and Rasmussen, 2006) to define distributions over 
the evolution of each state and their associated time derivatives. 

7. 1. Gaussian Process Regression on State Variables 

Let us first introduce a statistical model based on GP regression to describe the time evolution of 
the observed dynamical system. For notational convenience assume independent GP priors on the 
state variables such thatp(X„.. \<p n ) = A/"(X„.. |0, C P J, where C Vn denotes the matrix of covari- 
ance function values with hyperparameters ip n . With noise e„ ~ Af(0, o^I T ), the state posterior, 
p(X ni .|Y„ ; .,cr n ,<£„) follows as Af(X ni .\fi n , S n ) , where fi n = C Vn (C Vn + a£l) _1 Y n) . and 
S„ = a^ l C lPrl (C, Pn + cr^I) -1 . Given priors 7r((7 n ) and n(ip n ) over the hyper-parameters their 
posterior is p((f n , a n \Y n _.) cx 7r(<7„)7r(<£>„).A/'(Y„ ; . |0, (j 2 n l + C Vn ) . The conditional prior for the 
state-derivatives follows as p(X ni . |X n> . , <p n ) = A/"(X ni . |m n , K„), where the mean and covariance 
are given by m„ = 'C Vn (C V J- 1 X„,. andK„ = C Vn - 'C Vn (C v „) _1 C Vb with C Vn denoting 
the auto-covariance for each state-derivative, and C^ n and 'C Vn denoting the cross-covariances 
between the state and its derivative (Williams and Rasmussen, 2006). The GP specifies a jointly 
Gaussian distribution over the regression function modeling the system states and their time deriva- 
tives. 

In Calderhead et al. (2009) a second statistical model of the state derivatives is obtained by 
assuming normal errors with variance 7„ between the state derivatives and the value of the vector 
field obtained when plugging in the GP posterior samples of the state values, p(X n ,. |X, 6, 7) = 
A/"(X„,.|f„(X, 8, t), 7„I). Both the GP-based regression model, p(X„ ; .|X ni ., <p n ), and the model 
representing the state derivatives in terms of the induced vector field, p(X nj . |X, 0, 7), are combined 
in product form to model p(X„ ; . |X, 6, 7, cp, er). By analytically marginalising the state derivatives 
the following sampling scheme provides samples fromp(#, X, (p, cr, j\Y) (see Appendix C), 

p(¥>„,tr„|Y ni .) cx ■K(a n )-K{<p n )N{Y n .\0,all + C Vn ) V n=l---N (19) 
p(X nr \Y nr ,a n ,ip n ) = 7V(X„,.|/x„,S n ) Vn = l---JV (20) 
p(0,~,\X,cp,CT) cx cxp(-;7(X,0,7,<p,<T)/2)7r(7)7r(0) (21) 

where U(X, 6, 7, <p, cr) = J]^ r = i(fn-m„) T (K„+l7 n ) _:L (f n -m„), with f„ denoting f„(X, 6, t). 
It is clear that Metropolis sampling is required for (19) and (21). Given the complexity of the in- 
duced likelihood function for nonlinear differential equations with respect to the structural param- 
eters 9, see Ramsay et al. (2007) and Calderhead et al. (2009), an RM-HMC sampling scheme is 
now considered for both (19) and (21). 

7.2. Metric Tensor and Derivatives for Systems of Nonlinear Differential Equations 

To implement RM-HMC for (21) we require the metric tensor and its derivatives with respect to the 
target parameters. In Appendix D it is shown that an approximate form for the Fisher Information 
for the parameters 8 and hence the metric tensor is G(6) ~ X^=i F ra (K ra + l7„) _1 F^ and the 
required elements of the derivative of the above metric tensor follow 

dgd.d' _ \ - s .s> ( <9 2 f«, s di n ,s' di n , s ' d 2 i n , s \ 

M ~ 2^ K n [ de . d6d d9d , + Q 9d d id d ,J ( ' 




Fig. 6. Output for species V (left) and species R (right) of the Fitzhugh Nagumo model with param- 
eters a = 0.2, b = 0.2, c = 3. An example noisy dataset is shown by the red points. 



where k^ s denotes the s, s' element of (K„ + l7 n ) _1 , and all the first and second order partial 
derivatives of the vector field can be obtained analytically from the system of differential equations. 

From details given in the appendix, the metric for each of the error variance terms -v/tTT is the 
Fisher Information .9(^/7^1) = 2^/=^ ~ 27„trace (H^H^ 1 ) where H„ = (K„ + l7„), and the 
corresponding derivative dg(^/%)/d^/%. = 2trace (2 v /t^H^ 1 H^ 1 (I - 7„H~ 1 )). 

Finally, Appendix E provides the details for the RM-HMC procedure for (19). We now have 
everything required to implement an RM-HMC sampling scheme for dynamical system models 
defined by systems of nonlinear differential equations. 



7.3. Experimental Results for Nonlinear Differential Equations 

We now present results comparing the sampling efficiency for the parameters of the Fitzhugh 
Nagumo differential equations (Ramsay et ah, 2007), 

/ V 3 \ (V-a + bR\ 
V = c(V- — + R\, R = -l J (23) 

We may follow the sampling scheme given by equations (19) to (21) to obtain samples from the joint 
posteriorp(0, X, <p, cr, 7|Y), and so in this example Xi . = V and X2,. = R. We note that we can 
sample exactly from equation (20) and so we investigate the efficiency of sampling methods for the 
GP hyperparameters, the model mismatch parameters and the structural model parameters. Indeed, 
we split the sampling step given by equation (21) into two steps for ease of computation, firstly sam- 
pling p(~f\8, X, tp, cr), then sampling p(8\~f, X, ip, cr). We employ a Metropolis-Hastings scheme, 
MALA, HMC and RM-HMC, using Scheme 1, as described for the logistic regression simulations. 
We again compare the simulations by calculating the effective sample size (ESS) normalised by the 
computational time required to produce the samples. 



7.3.1. Sampling GP Hypeqjarameters 

We first examine the sampling of the hyperparameters for the GP (equation (19)). We investigate the 
performance of each method with data generated from the Fitzhugh Nagumo ODE model. We used 
100 data points generated between t = and t = 20 with the model parameters a = 0.2, b = 0.2, 
c = 3. Gaussian distributed noise was then added to the data, with standard deviation equal to 10% 
of the standard deviation of each species respectively, see Figure 6. 

All of the methods converged to the three dimensional target distribution within 2000 burn in 
iterations (according to the acceptance ratios and Gelman's R statistic), after which 2000 posterior 
samples were collected. We calculated the ESS for each parameter and used the minimum value 
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Table 12. Fitzhugh Nagumo Species V: Summary of results for 100 runs of GP regression for 
2000 posterior samples 



Sampling Method 


Time (s) 


Mean ESS (<p% , <p% , a v ) 


Time/(Min mean ESS) 


Rel. Speed 


Metropolis 


17.8 


214, 194, 282 


0.091 


X32.5 


MALA 


16.1 


5.4, 5.9, 50.3 


2.960 


xl 


HMC 


56.2+17.8 


797, 1311, 1425 


0.093 


X31.8 


RM-HMC 


101.8 


1987, 1581, 1821 


0.064 


X46.3 



for comparison, calculating the time per effectively independent sample. 100 simulations were run 
for each method, and for each simulation new experimental data was generated, thus the marginal 
distributions were slightly different in each run. All methods were implemented in the interpreted 
language Matlab for consistency of comparison. 

In order to implement our RM-HMC scheme for sampling the GP hyperparameters, we must 
compute the metric tensor, given by the Fisher Information matrix, as well as its partial derivatives 
with respect to each of the hyperparameters of the GP. These can be derived straightforwardly as a 
function of the GP covariance function and its first and second partial derivatives (see Appendix E 
for full details). In this example due to the smoothness of the dynamics induced by the system of 
equations we employ a stationary radial basis covariance function, although other covariance func- 
tions may also be employed to better capture the characteristics of the specific data being modeled. 

For setting the masses in the HMC scheme, we employed estimates using the average marginal 
variances obtained from running the Metropolis-Hastings scheme and chose the stepsize and number 
of integration steps such that the acceptance rate was greater than 70%. The setting of the masses 
using such exploratory Metropolis runs was necessary to achieve a reasonable acceptance rate, since 
the marginal distributions of the hyperparameters were of different orders of magnitude. In our 
results we therefore add the average time taken for a Metropolis run to the average time taken for 
an HMC run, although in practice extra time is required to implement this necessary tuning. 

The results of our simulations are given in Tables 12 and 13, and the posterior samples from 
a typical run for each method are given in Figure 7. RM-HMC samples about twice as effectively 
as the M-H scheme in terms of normalised ESS, and the difference in the correlation between 
consecutive samples is clearly visible in the trace plots. We see that MALA fares very badly when 
sampling the hyperparameters, and this is due to the different orders of magnitude of the marginal 
distribution in each dimension. The standard implementation (Roberts and Rosenthal, 1998; Roberts 
and Stramer, 2003) employs a symmetric proposal distribution, thus when the algorithm adapts its 
stepsize (according to the acceptance rate) it is limited by the dimension with the smallest marginal 
variance. It therefore generally samples one parameter much more efficiently than the others, as may 
be seen from the ESS values reported in Tables 12 and 13. We could of course tune MALA by pre- 
multiplying by a mass matrix, and so the approach becomes equivalent to an HMC scheme with just 
one integration step when proposing new hyperparameters. The difference between the two methods 
then is that HMC suppresses the random walk aspect of MALA, which is generally desirable for 
improving the speed of exploration of an unknown probability distribution. Of these two methods, 
we therefore consider only HMC in the following sections, given that it is a generalisation of the 
Langevin approach. 

7.3.2. Full RM-HMC Sampling Scheme 

We now also require the first and second partial derivatives of the Fitzhugh Nagumo equations in 
order to calculate the metric tensor (see Appendix D) for employing RM-HMC to sample from the 




Fig. 7. Trace and corresponding autocorrelation plots for 2000 posterior samples of the 3 RBF 
hyperparameters {ipX , and a v , top to bottom respectively) for species V of the FHN model using 
Metropolis (top), MALA (second top), standard HMC (second bottom), and RM-HMC (bottom) 



Riemannian Manifold Hamiltonian Monte Carlo 23 
Table 13. Fitzhugh Nagumo Species R: Summary of results for 100 runs of GP regression for 



2000 posterior samples 



Sampling Method 


Time (s) 


Mean ESS (<fi , ip? , <r H ) 


Time/(Min mean ESS) 


Rel. Speed 


Metropolis 


20.7 


128, 146, 251 


0.164 


X18.2 


MALA 


14.6 


4.9, 5.2, 79.2 


2.980 


xl 


HMC 


77.5 + 20.7 


343, 559, 942 


0.286 


xl0.4 


RM-HMC 


126.5 


1767, 1692, 1941 


0.075 


x39.7 



distribution given in equation (21), 

dV _ dV _ dV _ ( ( V -a + bR \ 

da db dc \ 3 J 1 da c db c ' dc \ c 2 J 

All of the second derivatives of V with respect to the model parameters are equal to zero, and the 
five non-zero second partial derivatives of R are as follows, 

d 2 R _j_ cPR_ _ R cPR_ 1 d 2 R _ R d 2 R _ 2 f -V + a-bR \ 
dadc c 2 ' dbdc c 2 ' dcda c 2 ' dcdb c 2 ' dc 2 \ c 3 J 

We now compare the performance of M-H, HMC and RM-HMC. We note that the performance 
of the HMC scheme was very sensitive to the tuning of its mass parameters, and the number and size 
of the integration steps required for a reasonable acceptance rate were considerably higher and lower 
respectively than those needed when using RM-HMC. We see that the extra computational expense 
incurred in computing the metric tensor and its derivatives is offset by the fact that fewer integration 
steps are needed for each new parameter proposal since larger stepsizes may be employed. 

Additionally, the tuned HMC scheme sometimes took much longer to converge than RM-HMC, 
particularly if the initial model parameter values were far from the true values, since the integra- 
tion parameters were tuned to optimise sampling from the true posterior and different integration 
parameters are often required to sample efficiently in different regions of parameter space. Figure 
(8) shows the trace of a Markov chain generated by each of the methods with the initial parameter 
values a = 6, b = 6, c = 6. The difference in the rate of convergence is striking, with RM-HMC 
converging the quickest. While the M-H scheme converges slower than RM-HMC, it converges 
much quicker than HMC with tuned masses, although it does then require another few hundred iter- 
ations to adapt to the posterior mode until reasonable acceptance rates are achieved. Once the tuned 
HMC scheme does finally reach the posterior mode it visibly samples much more effectively than 
Metropolis. 

On the other hand, if we consider an HMC scheme with the same size and number of integration 
steps but with unit masses, instead of carefully tuned masses, we see the Markov chain traverse the 
parameter space much more quickly but then sample very poorly from the posterior mode, since the 
stepsize is too large relative to the width of the mode. The transient and stationary phases of the 
HMC algorithm clearly require different level of scaling of the masses, see Christensen et al. (2005) 
for a theoretical analysis related to MALA, whereas RM-HMC overcomes this issue automatically. 

In summary, the masses in HMC may be set approximately equal to the marginal variance 
of each parameter to ensure efficient sampling from the posterior. However, if these masses are 
very small relative to the size of space being explored, then the chain cannot travel far in one 
iteration resulting in slow convergence to the posterior mode. In this example we again see the clear 
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Fig. 8. Trace of the first 1000 model parameter samples for Fitzhugh Nagumo with Metropolis (top), 
HMC with unit masses (second top), HMC with tuned masses (second bottom) and RM-HMC (bot- 
tom) 



Table 14. Fitzhugh Nagumo: Summary of results for 100 runs of the full sampling 



scheme with 2000 posterior samples 



Sampling 


Burn-in 


Posterior 


Mean ESS 


Total Time/ 


Relative 


Method 


Time (s) 


Time (s) 


(a, b, c) 


(Min mean ESS) 


Speed 


Metropolis 


201.8 


165.7 


397,417, 311 


1.181 


xl.04 


HMC 


460 


607 


1399, 865, 1254 


1.234 


xl 


RM-HMC 


52.5 


252.4 


1978, 1931, 1824 


0.317 


X3.89 



advantage of the RM-HMC scheme; it is self-tuning in that the metric tensor automatically adapts 
the mass matrix to the local topology of the parameter space, allowing it to take bigger steps through 
parameter space when required. This is often the case during the burn in period, particularly for this 
type of inference problem where, unlike in the logistic regression example, a nonlinear likelihood is 
induced by the nonlinearities of the differential equation model. The results of our simulations are 
shown in Table (14). Although Metropolis is quickest at drawing 2000 posterior samples, its mean 
ESS is around five times smaller than that of RM-HMC and, importantly, takes much longer than 
RM-HMC to converge to the target distribution. As a result, Metropolis requires around 3 times 
longer than RM-HMC to produce an effectively independent sample. The HMC scheme fares the 
worst of the three methods. As we have discussed, this is due to the rather sensitive effect that tuning 
the masses has on the ESS. In addition, since the masses are fixed throughout the simulation, much 
smaller integration stepsizes are required, and consequently a larger number of integration steps. 
This HMC approach then becomes computationally very intensive for this problem and is seen to 
be less efficient than Metropolis as a direct result. The fast convergence of RM-HMC combined 
with its high ESS scores results in this approach being the most efficient of the three methods, even 
normalising against the computational effort required. 
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8. Conclusions and Discussion 

In this paper a Riemannian Manifold Hamiltonian Monte Carlo sampler has been developed in 
an attempt to improve upon existing MCMC methodology when sampling from target densities 
that may be of high dimension and exhibit strong correlations. It is argued that the method is fully 
automated in terms of tuning the overall proposal mechanism to accommodate target densities which 
may exhibit strong correlations, widely varying scales in each dimension, and significant changes 
in the geometry of the manifold between the transitional and stationary phases of the Markov chain. 

By exploiting the natural Riemannian structure of the parameter space of statistical models the 
proposed method can be seen to be a generalisation of both HMC and MALA methods and as such 
overcomes the oftentimes complex manual tuning required of both methods. In high dimensional 
problems such as inferring the 4096 dimensional latent Gaussian field MALA and HMC fail com- 
pletely due to the high levels of spatial correlation in the latent field and can only proceed after a 
transformation is used to break those correlations. In contrast RM-HMC proceeds without the need 
for such a transformation or indeed phase specific tuning. 

A novel semi-explicit symplectic integrator is developed for the non-seperable Hamiltonian that 
emerges due to the appearance of the metric tensor in the log joint-likelihood. Of course the Gen- 
eralised Leapfrog method for non-separable Hamiltonians is available see e.g. Leimkuhler and 
Reich (2004) however the updates for parameters and auxiliary variables (momentum) are defined 
fully implicitly requiring further nonlinear solutions for the iterations. The Sundman transformation 
could be considered but it is unclear how this would be at all practical for the general methodology 
which was being developed. What is important is that the second order convergence of the New- 
ton step has, empirically, been found to require a single step in all of the examples that have been 
considered in this paper when simulating paths across the manifold. The overall RM-HMC method 
employing this symplectic integrator has been shown to provide highly efficient convergence and 
exploration of the target density for the range of models considered. 

Clearly there are two main overheads when employing RM-HMC, the development of the ana- 
lytical expressions for the metric tensor and the associated derivatives as well as the 0(N 3 ) scaling 
of solving the linear systems when updating the parameter vectors i.e. inverting the metric tensor. 
In all but the nonlinear differential equation example, exact analytical expressions for the Fisher In- 
formation could be obtained. It remains to be seen what other classes of statistical models may have 
this same issue, nevertheless even with this approximation RM-HMC remains superior to HMC and 
MALA in time normalised sampling efficiency in this example. 

The issue of the 0(N 3 ) scaling is something which deserves further consideration. In some 
statistical models there is a natural sparsity in the metric tensor, the SVM is a case in point where 
due to this structure RM-HMC was computationally more efficient than HMC. In other models this 
is not the case for example the logistic regression model and the Log-Gaussian Cox model. It should 
be noted that adaptive MCMC methods, see e.g. Andrieu and Thorns (2008), also incur the same 
level of cubic scaling. At the very high dimensional end of the scale a decorrelating transformation 
is required for MALA and HMC and this will also incur an 0(N 3 ) scaling however further work to 
characterise the incurred computational costs at the intermediate dimensionality regime will be of 
value. 

In summary the RM-HMC method provides a novel MCMC algorithm whose performance has 
been assessed on a diverse range of statistical models and in all cases has been shown to be superior 
to similar MCMC methods. We finally note, as has been highlighted previously, (Neal, 1993a, 
1996; Liu, 2001), that RM-HMC can be embedded within a population MCMC procedure when the 
posterior has distinct separated modes, see e.g. Calderhead et dl. (2009) . 
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A. Symplectic Integrator for Non-Seperable Hamiltonians 

To develop the Riemannian Manifold Hamiltonian Monte Carlo (RM-HMC) method an non-implicit 
symplectic integrator is required for non-separable Hamiltonians of the form 

H(e,p) = q>(0) + lp J G(8)- 1 p (24) 
A. 1. One Dimensional Case 

To clearly illustrate the manner in which the symplectic integrator is derived the simplest case of 
univariate random variables is considered where (f)(0) = and the Hamiltonian is 

-km (25) 

with associated equations of motion 

d9 8H p dp 8H ( l dgjO)-^ 2 2 

Introducing a general dynamical variable W(9,p) whose time evolution is determined by the 
above equations of motion then the following can be considered as a Poisson bracket operator equa- 
tion for W(9,p) (Hairer et al, 2002) 

with solution 

W t+e = exp(e(T + l>)) W t (28) 

Symplectic integrators are derived by approximating the short time evolution operator in product 
form such that 

cxp (e (f + V - )) w II CX P ( Uef ) ex P ( v ^) (29) 

i=i 



Riemannian Manifold Hamiltonian Monte Carlo 29 

provided that the effect of each component operator can be computed exactly and coefficients t i , Vi 
are determined by the necessary order. For example 

"(*)• - ( 1 + H" 2 |) + K"< , '"' 2 |;) 2 + 5!(°< <, "' 2 C 3 



exp 



dp 



P 



1 - ea{6)p 



where the latter corresponds to the integration of dp/dt while holding 6 constant. Now considering 
the exp(eT) operator, it is straightforward to see that exp(eT)p = p. As T = pg(6)~ 1 -§g, the 
polynomial expansion will act repeatedly on g(9)^ 1 and an analytic closed form solution will not 
emerge. This issue can be circumvented however by introducing a change of variable from 9 i— > q(9) 
such that g(9)d9 = dq =>■ q = J g(9)d9. The T operator now simplifies to T = p-j^ and one 
has exactly 

exp(eT)g = ( \ + € p^- + \ ( ep-^] + ■ • ■] q = q + ep (30) 



dq 2 \ dq 

We are now in a position to define a second order symplectic integrator by employing the following 
factorisation, 



exp (e (t + v)) « cxp (eV /2j exp (eTj exp (eV /2) (31) 

: expre 
pro vie 

(6 t+e ,Pt+e)- 



Gathering the expressions for each of the two operators acting on both 6 and p yields the following 
algorithm to provide the mapping (9 ,po) * (6,p), i.e in terms of time evolution (6a Pt) l— * 



Po 

Pi = -, 1 — ( 32 ) 

1 - ^eaiOojpo 

q(6) = q{9 ) + e Pl (33) 
1 - ±ea(9)pi 

To obtain 6 from equation (33), one can regard it as the root of f(6) = q(8) — q(6o) — ep\ = 
and solve for it via a Newton iteration, 6 = 9 n — f(6o)/f'(6o) = @o + <?(#o) _1 ef>i> repeating with 
9 9 as necessary for convergence. Whilst this step of the integrator is defined iteratively it should 
be stressed that the overall symplectic integrator that is presented in this paper is, computationally, 
an enormous improvement to fully implicit methods such as the Generalised Leapfrog, see e.g. 
Leimkuhler and Reich (2004). 

Prior to considering the full multivariate case a generalisation of the operator V is presented as 

V = (a(9)p 2 + (3(6)P + 1(0)) ^ (35) 
Now since the following equalities hold, 

2 9 \ _ P _ / ,o„ ^ "\ „ o\„ _ ( & 



exp ( tap —J p = - _ exp ^e/3p— J p = exp(e/3)p, exp ^7— ) p = p + 67 
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a second order factorisation of exp(eV') follows as 



.1 a\ (i 2 d\ ( d\ (i 2 d\ (id 

exp(eV ) — exp -z^Jtt- 1 exp —tap tt- 1 ex P e PP-^~ 1 ex P 7; ea P tt - 1 ex P o e 7~ 



2 dp J \ 2 dp / \ <9p / \ 2 9p / \2 dp J 
from which the following general algorithm emerges 

Pi = Po + n/2 (36) 

P2 = jt: (37) 

1 — eapi/2 

P3 = exp(e/3)p 2 (38) 
Pi l-eap 3 /2 

p = PA + ei/2 (40) 

which will be referred to as p = g(po, e, a, (3, 7). Now the full multi-dimensional case will be 
considered in developing the eventual symplectic integrator required for RM-HMC. 



A.2. Multi-Dimensional Case 

In this case the Hamiltonian equations of motion follow as 

f -£-<«<•>-"-).■ m 

where ip k (0) = -d<j>{6)/d6 k , A k {0) = -^dG(e)- 1 /d6 k , and (-) fc denotes the k th element of 
the corresponding vector. The Poisson operators are then defined as 

f = ^ e) ^ w k v = M + pTAfe w p ) w k 55 p + A 

As in the one dimensional case we define Qi(9) — J g^ (9)d6j, where denotes the ij'th ele- 
ment of the metric tensor G(0) then it follows that d/d6 k = {dQi/d9 k ){d/dQi) = J2i 9ki(0)d/dQ h 
thus simplifying the T operator to T = p k d/dQ k . The metric tensor is symmetric so 

exp(ef )Q = (1 + ep • O/OQ + i(ep • d/OQ) 2 + )Q = Q + ep (42) 

where • denotes the dot product. The update O — > follows via Q(0) = Q(0o) + ep. Note 
this system of equations can be solved via a Newton iteration = Q + (dQ(9 Q )/d6 )~ 1 ep = 
0o + eG^ 1 (0 n )p as in the uni-dimensional case. The operator cxp(eF) is simply a forward time 
shift of the momentum p and as such p = p + e(p(0o). 



A.2.1. Implementing the exp(eF) Operator 

Splitting the operator exp(eA) such that A = J2iLi where D is the dimension of the vector 
space and each A, = p J A l (0)p-^- then to second order 
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D-l 1 

cxp(ei) = Y[ exp(ei 2 /2)cxp(ei D ) J[ exp(ei J /2) + G(e 3 ) (43) 

i=l j=D-l 

where the factorization is left-right symmetric guaranteeing time-reversibility. Now p T A fc (0)p = 
a kPk + (3k(p~k)Pk + 7fe(p-fe), where p ^ denotes the vector with the fc'th element removed and 

a k = A k kk , P k {P-k) = 2j2^ kPl , 7fc(P-fc) = I] ^ijPiPj 

i^k ij^k 

allows each element of the operator to be written as 

cxp(ei fc ) = exp {a k p\ + Pk(P-k)Pk + 7fe(P-fe)) 

The overall factorisation for exp(eA) amounts to repeated calls to p = g(p* ,e,a,f3, 7), (equations 
starting from 36), where p* is the current momentum value. These calls must follow the sequential 
order of equation (43), such that the vector update p* p is denoted by the vector function 
p = g(6, p*, e), which is defined as 

Pk = g(pt,e/2,a k ,/3 k {[pi:k-i P*fe+i:u]),7d([Pi:fc-i P*k+i: D })) for fc = l to {D-l) 
Pd = g{p* D ,e,a D ,(3 D {pi :D -i),^ D {p 1 .. D - 1 )) 

Pk = g{pk,e/2,a k ,0k{[pi-.k-i Pfe+i:u]),7d([pi:fc-i Pk+i-.D])) for k = (D - 1) to 1 
where p is an intermediate variable used to implement the sequential updating of the momentum. 
A.2.2. Overall Symplectic Integrator in Multi-Dimensional Case 

Finally the overall time-reversible symplectic evolution operator can be obtained to second-order 
by the following splitting of the non-separable Hamiltonian 

exp(e/ , /2) exp(ei/2) exp(ef) exp(ei/2) exp(eF/2) 

which yields the overall operator (p , Oq) — > (p, 0) where p , Oo are the current momentum and 
parameter values respectively. This is now denoted as the vector function (p, 6) = f (po, 60, e) 



Pi = Po + (e/2)v?(0 o ) 

P2 = g(0 o ,Pi,e/2) 

e = e + eG- 1 {e ) P2 

p 3 - g(0,p 2 ,e/2) 

p = p 3 + (e/2)tp(0) 



where the third step may be iterated if required. This is referred to as Scheme 1 . An alternative split 
is possible yielding the alternative Scheme 2 given below. 
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Pi = Po + (e/2M<? ) 

P2 = g(0i,Pi,e) 
= e l + {e/2)G- 1 (6 l )p 2 
p = p 2 + (e/2)<p(0) 

where the second and fourth step may be iterated if required. The necessary time-reversible volume 
preserving integrator is now available to fully define the RM-HMC algorithm. 



B. Derivation of Stochastic Volatility Equations 

The joint likelihood can be written as 

T T 



P{y, x, 0, (7, <j>) ^Y[p(yt\x t ,P)p(x 1 )Y[p(x t \x t ^i,a,(j))T:{P)n(a)Tr(<j)) (44) 



1=1 



where, following Liu (2001), we use the priors p(f3 2 ) oc /3~ 2 , a 2 ~ Inv-x 2 (10, 0.05) and (</> — 
1) /2 <~ Beta(20, 1.5). We now introduce a transformation of the parameters to ensure they have the 
appropriate support when sampling i.e. a = cxp(7) and = tanh(a), and the partial derivatives 
of log-joint likelihood with respect to transformed parameters are as follows 



8L 
8L 

dL 

da 




(45) 

(46) 

(47) 
(48) 



If we want to sample the parameters using RM-HMC, then we also need expressions for the metric 
tensor and its partial derivatives with respect to [3, 7 and a. We can obtain the following expressions 
for the individual components of the metric tensor 



E 



dL dL 
~d[3~di3 
(8L8L 
I dj da 
(dLdL 
\df3 d~/ 



1 



2T E [dLdL 

/3 2 ' \ 97 c*7 



da da 



eI—— 

\a 7 d/3 



E dLdL 

df3 da 



E {— — \ 

dad/3 J 



(49) 
(50) 
(51) 



Thus the metric tensor may be written as 



G 



2T 
W 








T+l 

2<j> 
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f(3-T) + (T-l) 
and the partial derivatives of the metric tensor follow as 



dG 

00 



4T 
'W 
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2(1 -4> 2 ) 

20(l-0 2 )(3-T) 



C. Derivation of Conditional Density for ODE parameters 

A model of the density p(X|X, 0, 7, p, cr) is required which will assign high probability mass 
to values of the state derivatives when the ODE parameters and the corresponding GP param- 
eters are consistent. One way this can be achieved is to ensure that the GP regression model, 
A/"(X„ r |m„, K„), and the model of structural mismatch, A/"(X„ ; . |f„(X, 0, t), Ij n ), both assign 
similar probability mass to state-derivative values having consistent regression and structural (ODE) 
parameters. This requirement can be satisfied by making the modeling choice that, p(X|X, 0, 7, <£, <r), 
be represented as a product of Gaussians such that 



,*, x ft _ p(X,X|g, 7 ,y,<r) _ Un^fr*, |m n , K n )Af(X n , |f n (X, 0, t), I 7w ) 

I ' 'T'^'^ p{X\0, 1 ,<p,cr) n„AA(m„|f„(X,0,t),K„+I 7n ) 

By equating both denominators of the above expression to obtain the marginal density for the 
state values after the state derivatives have been marginalised and denoting U (X, 0, 7, ip, cr) = 
X^=i( f ™ _ m„) T (K„ + l7„) _1 (f„ - m„) then it is clear that 

p(X|0,7,¥>,cr) = 1 rcxp ( ~U{X, 0, 7, <p, cr] 

where the normalising constant is simply Z(0, 7, <p,a) = J exp (— |£7(X, 0, 7, <p, cr)) dX. It 
then follows that 

MlK*'") = p( xV) igfo ex) 6XP (~ °> 7 ' ^ g >) 7) 

where Z(y>, cr) = J exp (— |t/(X, 0, 7, (p, cr)) 7r(0, ~f)dXd0d~f. Now as each of the terms ap- 
pearing in the denominator Z(ip, cr) J\ n jV(X„ ; .|0, C Vn ) will be the same value in the numerator 
and denominator of the Hastings ratio then it cancels out giving that 

p(0, 7 |X, p>, cr) cx exp (- ^U(X, 0, 7, ip, tr)j n(0, 7) (52) 

Since the gradients appear only linearly and their conditional distribution given X is Gaussian they 
can be marginalized exactly. In other words, given observations Y, we can sample from the condi- 
tional distribution for X and marginalize the augmented derivative space. The differential equation 
need never now be explicitly solved, its implicit solution is integrated into the sampling scheme. 
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D. Derivation of Metric Tensor for Nonlinear Differential Equations 

We consider both the ODE parameters and the corresponding error variances 7 independently 
when sampling from the conditional posterior (21) and so consider the Fisher Information for each 
set of variables separately. Let us first of all consider 0. The sampling density is p(K\0, 7, (p, cr) 
and so the log-likelihood takes the form 

£(X, 0, 7, <p, cr) = ~U(X, 0, 7, v , cr) - log (Z(0, 7, <p, cr)) (53) 
D. 1 . Fisher Information for ODE parameters 

A straightforward result shows that the Fisher Information matrix for the above generic form of 
likelihood is 

J e = E { V e £V e £ T } - X -E {v e t/V e (7 T } - ±E {V e U} E { V e C/ T } (54) 

where the expectation is taken with respect to p(K\0, 7, <p, cr). Now the required derivative vector 
is VgU = J2 n =i F «( K n + l7n) _1 ( r 'n ~ m «) where the D x T matrix F„ has elements F„ d s = 
df n (0, X, t = s)/dO d . It then follows that 

1 N 

1e = i ^i?{F„(K„ + I 7ll )- 1 (f„-m„)(f„-m Il ) T (K„ + I 7ll )- 1 FT}- 

n=l 

N N 

{F„(K„ + I-yn)- 1 ^ - m„)} E {( f n ~ m„) T (K„ + I^Fl} 

n—l n—1 

An exact analytic form does not follow due to the nonlinearity of the function J7(X, 6, 7, (p, cr) 
therefore at this stage estimates of the Fisher Information must be made by sampling from the 
density p(K\0, 7, <p, cr) or approximations must be made. 

Here we make approximations by employing a surrogate sampling density for p(K\0, 7, <p, cr) 
over the random vectors m„ which is A/"(m„|f„(X, 0, t), K„ + l7„) for each n. One further 
approximation we make is that the elements of the matrices F„ are constant relative to the vectors 
f„ and m„. With these in place and noting that E{f n — m„} = and E{(f n - m„)(f„ — m„) T } 
K , + Ly„ under the surrogate density, then 

N 
n=l 

Whilst approximations have been made to arrive at this convenient analytic form for the Fisher 
Information it should be highlighted that in terms of employing this expression as a metric tensor 
the elements are in place to describe approximately the local geometric structure given that the 
elements of each matrix F„ are the time derivatives of the sensitivity coefficients of the systems of 
ODEs. However we discuss the implications of this approximation further in Section 7.3. 

D.2. Fisher Information for Model Mismatch Variance 7 

The Fisher Information for each y 7 ^, which we denote by S n , is represented as 
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As the derivative dU n /dS n = -S n (f n - m„) T (K„ + l7 n ) _1 (f„ - m„) then 

Is n = l n E |(f„ - m„) T (K„ + l7 n ) _1 (f„ - m„)(f„ - m„) T (K„ + l7„) _1 (f„ - m„)| - 

- m„) T (K„ + l7„) 1 (f n - m n )| 

As in the previous section we approximate the sampling density p(K\9, 7, ip, er) by the surro- 
gate Gaussian A/"(m„|f„(X, 0, t), K„ + Ij n ), in which case after some standard manipulation 

l Sn « 2 7 „trace ((K„ + I 7 „)- 1 (K„ + 1 7 „)" 1 ) (56) 

E. Metric Tensor & Derivative of Gaussian Process Regression Model 

The marginal likelihood for a linear regression model under a GP prior is defined in the main text 
as p(y\<p,cr) = Af(y\0, K), where K = <r 2 I + C v and denoting the full set of parameters as 
4> = (ip, <t) then the derivative of the log-marginal and the metric tensor - Fisher Information 
matrix - follow in standard form as 



_8_ 

d(f>., 



log (p(y|0)) = itrace ( (K^yy 7 ^ 1 - KT 1 ) (57) 

G(^), = itrace(K-^K-^) (58) 

Application of standard derivative of trace operators provides an analytic expression for the 
derivative of the metric tensor with respect to the parameters 



d<j>k d(f) k 



(59) 



In our experiments we employ an infinitely differentiable stationary covariance function to cal- 
culate the (i,j) th entry of the covariance matrix, 

K itj = (px exp ^- 2~| (*J ~ ^) 2 ^) + ( 6 °) 

The Fisher Information matrix above may therefore be obtained using the first and second partial 
derivatives of the covariance function. The first partial derivatives follow as, 

The second partial derivatives may also be easily calculated, and indeed out of the nine second 
partial derivatives, only three of them are non-zero which eases their computation. 

d 2 K.ij d 2 K.ij d 2 K.{j d 2 K.ij d 2 K.{j d 2 K.ij 



dip\ dipida d(fl2da dcrdtpi dad(f2 da 2 

M d2 ^l_ d ^iAfl [l _^ )lt ._ ti) 2 



dipid(p 2 tpi d(p 2 ' d(p 2 dipi ifl 



